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© Population counting in computer systems. 

© In an embodiment of the present invention popu- 
lation counting is performed by using a multiplying 
unit, in a computer system, including a plurality of 
multiplying sub-units for simultaneously executing 
partial multiplications among elements obtained by 
dividing multiplicand data and multiplier data in a 
regular multiplication mode. In a population counting 
mode, input data for the population counting is di- 
vided into population counting elements instead of 
^the multiplier data and population countings for the 
^elements are performed simultaneously by using the 
^multiplying sub-units, producing partial counted data 
If) of the population counting elements, and the partial 
CD counted data are sent to a pair comprising a carry 
00 save adder and carry propagate adder by which a 
population counting result for the input data is ob- 
COtained and output. 
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The present invention relates to population 
counting in computer systems. 

With the development of computer systems 
data processing performed in computer systems 
has come to be executed at high speed. In the field 
of graphic display for example, the variable density 
of a graph has come to be processed rapidly by 
computer systems. That is, variable density is pro- 
cessed at high speed in a computer system by 
counting the number of "1 " bits in numerical data, 
represented in binary notation, including graphic 
information. Such counting of the number of "1" 
bits is called "population counting" and an instruc- 
tion to carry out such counting is called a 
"population counting instruction". The present in- 
vention relates to the execution of population 
counting in computer systems. 

Population counting has been performed by a 
circuit provided in a computer system exclusively 
for the purpose. With such a dedicated population 
counting circuit, the number of "1" bits in numeri- 
cal data, the data usually consisting of 8 bytes, can 
be- counted. However, the counting is performed 
one byte at a time, so that it takes a lot of time to 
count of the "1 " bits throughout the numerical data. 
Considering only the desirability of increased 
counting speed, the provision of a dedicated count- 
ing circuit which could perform counting two or 
more bytes at a time instead of one byte at a time 
might be contemplated. However, this is not prac- 
tically realizable because a large amount of hard- 
ware (electric parts) would be needed for such a 
dedicated circuit. Thus, the use of a dedicated 
circuit has the drawbacks of increased cost (for the 
circuit) and of low speed (i.e. considerable time is 
needed). 

In computer systems, particularly in recent 
high-speed computer systems such as so-called 
super computers, multiplying units can perform 
multiplication at high speed, handling data in units 
of more than two bytes, using carry save adders 
(CSA) and carry propagate adders (CPA), which 
are adders well-known for use in multiplying units 
in computer systems. Accordingly, if a multiplying 
unit could be used to perform population counting, 
the counting speed of population counting could be 
increased, without the need to provide a dedicated 
population counting circuit. Furthermore, in a com- 
puter system, the multiplying unit is generally not 
so often used and, moreover, population counting 
is not often performed. Accordingly, it can be said 
that the use of a multiplying unit for population 
counting would contribute to increase the effective- 
ness of usage of the multiplying unit, rather than 



I in computer systems 

disturbing the operation of the computer system. 

The use of a multiplying unit for population 
counting has been attempted by Shoji Nakatani, 
one of the inventors of the present invention. This 

5 attempt led to a proposal described in a laid-open 
Japanese Patent Application SHOH 62-209621 
(September 14, 1987). However, in SHOH 62- 
209621, the multiplying unit used includes only one 
multiplying circuit with a spill adder. Therefore, 

w when the multiplier of multiplication data is divided 
into a plurality of elements, the multiplication must 
be repeated in the multiplying circuit a number of 
times equal to the number of the elements. For 
example, when the multiplier consists of 8 bytes 

rs and is divided into 4 elements, multiplication must 
be carried out or repeated four times (in this case, 
the multiplicand of the multiplying data is not di- 
vided). Furthermore, the spill adder is needed for 
compensating lower digits which appear during 

20 repetition of the multiplication, so as to be carried 
up to the final multiplication results. 

Generally, there are two types of multiplying 
unit. For the first type of multiplying unit size is 
considered more important than counting speed, so 

25 that a first-type multiplying unit usually includes 
only one multiplying circuit The multiplying unit in 
SHOH 62-209621 is of this first type. For the sec- 
ond type of multiplying unit counting speed is 
considered more important than size, so that the 

30 second-type multiplying unit includes a plurality of 
multiplying circuits (sub-units) operating in parallel. 
The proposal of SHOH 62-209621 cannot be ap- 
plied to the second type of multiplier. 

An embodiment of the present invention can 

35 provide for an increase in the speed of execution of 
a population counting instruction given to a com- 
puter system including a multiplying unit of the 
second type (with multiplying sub-units). 

An embodiment of the present invention can 

40 provide for a decrease in the quantity of 
(dedicated) electrical parts provided for executing a 
population counting instruction in a computer sys- 
tem. 

An embodiment of the present invention can 
45 provide for the performance of population counting 
with less expense on computer system fabrication 
costs. 

In an embodiment of the present invention a 
multiplying unit of the second type, including a 
so plurality of multiplying sub-units, is used. Use is 
made of CSAs and CPAs provided in the multiply- 
ing sub-units, adding only a small amount of hard- 
ware (electrical parts) such as an adder, selectors 
and a few logical circuit elements to each multiply- 
ing sub-unit. 
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When multiplication is performed in a multiply- 
ing unit it is usually for the purpose of executing a 
program in the computer system. This operating 
state or mode will be called a "regular multiplica- 
tion mode" hereinafter. However, in accordance 
with an embodiment of the present invention, the 
computer system is modified so that the multiply- 
ing unit can operate both for regular multiplication 
and for population counting. This latter operating 
state or mode of the multiplying unit will be called 
a "population counting mode" hereinafter. 

In the population counting mode, numerical 
data for population counting is set in a multiplier 
register in the second-type multiplyng unit and 
divided into a plurality of elements. The division of 
the multiplier is performed based on the process 
used in the regular multiplication mode; that is, the 
division is performed in consideration of the cal- 
culating form executed in the regular multiplication 
mode and the number of the multiplying sub-units 
provided in the second-type multiplying unit. The 
calculating form is a form for multiplying multipli- 
cand and multiplier given to the multiplying unit. 
Generally, in a second-type multiplying unit, there 
are several calculating forms. For example, accord- 
ing to some calculating forms, multiplication is per- 
formed by multiplying together elements obtained 
by dividing the multiplicand and the multiplier, and 
according to another calculating form, the mul- 
tiplication is performed by multiplying the multipli- 
cand, which is not divided, together with elements 
obtained by dividing the multiplier. 

After the numerical data for population counting 
is divided into a plurality of elements in the multi- 
plier register, the bytes of each element are sent to 
the respective multiplying sub-unit, and the number 
of "1 " bits in each element is counted by a CSA 
newly provided for the respective multiplying sub- 
unit for performing population counting, and a half- 
sum output (HS) and a half-carry output (HC) con- 
cerning to the number of bits "1" in each element 
are produced from the newly provided CSA. The 
HS and HC outputs are sent to a CSA and a CPA 
which have been (are ordinarily) provided in each 
multiplying sub-unit and added thereby, using the 
well-known Booth's algorithm. 

The counted results of the numbers of "1" bits 
in respective elements are sent from the multiply- 
ing sub-units to a common CSA and a common' 
CPA which also have been (are ordinarily) provided 
in the second-type multiplying unit, in which the 
counted results from the multiplying sub-units are 
added and the final results of the population count- 
ing output from the common CPA. 

In an embodiment of the present invention, 
since the hardware and the multiplying algorithm of 
the multiplying sub-units in the second-type mul- 
tiplying unit can be used effectively in parallel, 



population counting can be performed at high 
speed, using less hardware. 

Reference is made, by way of example, to the 
accompanying drawings in which:- 
5 Fig. 1 is a block diagram of a population 

counting circuit of the prior art provided in a com- 
puter system; 

Fig. 2 illustrates 8-byte input and output data 
relating to a population counting instruction; 
70 Fig. 3 is a circuit diagram of a population 

counting circuit of the prior art; 

Fig. 4 is a block diagram of a first embodi- 
ment of the present invention; 

Fig. 5 is a circuit diagram illustrating a popu- 
75 iation counting mode of the first embodiment; 

Fig. 6 is a circuit diagram of a first selector 
and a part of a multiple generator; 

Fig. 7 is a circuit diagram of a second selec- 
tor and a part of a multiple generator; 
20 Fig. 8 is a schematic chart illustrating a 

method of addition for population counting in a 
sub-unit- 
Fig. 9 is a schematic chart illustrating a 
method of addition of the number of "1" outputs 
25 from the four multiplying units in the first embodi- 
ment; 

Fig. 10 is a block diagram of a second 
embodiment of the present invention; 

Fig. 11 is a schematic chart illustrating a 
30 method of addition in a sub-unit to obtain a full 
sum and full carry; and 

Fig 12 is a schematic chart illustrating a 
method of addition <?f the number of "1 "s output 
from the four multiplying units in the second em- 
35 bodiment. 

Before describing embodiments of the present 
invention, a prior art dedicated population counting 
circuit and a prior art proposal for a first-type 
40 multiplying unit capable of performing population 
counting will be briefly explained with reference to 
Figs. 1 to 3. 

Fig. 1 is a block diagram of a dedicated popu- 
lation counting circuit provided for a computer sys- 

45 tern. In the dedicated population counting circuit, 
population counting is performed as follows: nu- 
merical data (consisting of 8 bytes, in binary nota- 
tion) for the population counting is given to a regis- 
ter (REG) 50; the 8-byte data is transferred to a 

so further-register REG 52 through a selector 51 , and 
a byte constituting the lowest units of the 8-byte 
data in the REG 52, which will be called the lowest 
byte in the REG 52 hereinafter, is sent to an 
operation circuit 53 in which the number of "1" bits 

55 is counted and converted to a binary numerical and 
sent to a CPA 54. In the CPA 54, the number of 
"1" bits in the lowest byte in the REG 52 is 
counted and stored in an intermediate register REG 
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55. During the above step, the 8-byte data in the 
REG 52 is shifted to the right so that the next- 
lowest byte, from the point of view of the lowest 
byte treated in the above step, is set at the lowest 
byte position. Then the number of "1" bits in this 
next-lowest byte is counted by the same process 
as indicated above and the counted result for this 
next-lowest byte is added to the counted result for 
the previous lowest byte. In the CPA 54, counting 
of the number of "1" bits in respective bytes is 
repeated for every byte and the counted results are 
added. The output from the CPA 54 is sent to a 
result register REG 56 from which the number of 
"1" bits in the 8-byte numerical data is output as 
shown in Fig. 2. Thus, in the prior art dedicated 
population counter circuit, the count of "1 " bits has 
been performed by repeating eight times the pro- 
cess of counting the number of "1" bits in one 
byte. This results in considerable waste of time. 
The counting, theoretically, could be performed two 
or more bytes at a time, but in practice this is 
completely unrealistic from the viewpoint of hard- 
ware costs. 

The use of a multiplying unit for population 
counting has been attempted in accordance with a 
proposal as shown in Fig. 3. However, in this 
proposal, the multiplying unit is a first-type mul- 
tiplying unit including only one multiplying circuit 
with a spill adder, so that there is still a problem of 
poor counting speed as indicated below. 

When the first-type multiplying unit shown in 
Fig. 3 operates in the regular multiplication mode, it 
operates as follows: multiplication data including 
the multiplicand and the multiplier each consisting 
of 8 binary notation bytes for example, is set in a 
vector register (VR) 1 ; the multiplicand in the VR 1 
is transferred to a multiplicand register REG (CAND 
REG) 2a through a register REG 1a; the multiplier 
in the VR 1 is set to a register REG 1b and divided 
into four elements each consisting of 2 bytes (16 
bits); each element of 2 bytes is set to a decoder 
(DCDR) 3 in which the element is decoded into 
nine kinds of shift control signals, based on the 
well-known Booth's algorithm, wherein the nine 
kinds of shift control signals will be called 
"decoded signals" hereinafter; the decoded signals 
from the DCDR 3 are set to a multiplier register 
REG 2b as the multiplier for the multiplicand set in 
the CAND REG 2a; the multiplicand in the CAND 
REG 2a and the decoded signals in the multiplier 
REG 2b are sent to a multiple generator (MG) 4 in 
which the multiplicand is shifted as far as indicated 
(as far as the numerals designated) by the de- 
coded signals, this generation in the MG 4 being 
called multiple generation; the shifted multiplicands 
produced from the MG 4 are sent to a first CSA 
(CSAH)) 50 and a second CSA (CSA(2)) 51 in 
which the shifted multiplicands are added, produc- 



ing an intermediate sum and an intermediate carry 
of the products of the multiplicand and the relevant 
element of the multiplier at register REG 6a and 6b 
respectively; the above process is repeated four 

s times for obtaining the products of the multiplicand 
and the four elements; the outputs of the REGs 6a 
and 6b are sent to a first CPA (CPA(1)) 7 in which 
the four results concerning the four elements are 
added, producing the total number of the "1 " bits; 

10 and the total number, namely a multiplication re- 
sult, is set in a result register REG (2R) 8. 

When the regular multiplication mode is 
changed to the population counting mode in the 
first-type multiplying unit shown in Fig. 3, the first- 

75 type multiplying unit performs population counting 
as follows: the numerical data (for example consist- 
ing of 8 binary notation bytes) for the population 
counting is set to the VR 1 ; the numerical data for 
the population counting is transferred to the REG 

20 1b and divided into four elements each consisting 
of 2 bytes; each 2-byte element is selected from 
the lowest element by a first selector (SEL(1)) 1c. 
provided to the first-type multiplying unit, so that 
the lowest element is sent to a fourth CSA (CSA(4)) 

25 12, provided to the first-type multiplying unit, in 
which the number of "1" bits in each element is 
counted, producing the half sum (HS) of "1" bits in 
the element and the half carry (HC) provided dur- 
ing processing to producing the HS, at an HS 

30 register REG 12b and an HC register REG 12a 
respectively; the HS and HC respectively set to the 
HS REG 12b and the HC REG 12a are sent to a 
second selector (SEL(2)) 41 , newly provided in the 
first-type multiplying unit, in which the HS and HC 

35 are selected so as to be sent to the CSA(1) 50 and 
CSA(2) 51, suppressing the output of the MG 4 so 
that it is not sent to the CSA(1) 50; then the 
numbers of "1 " bits in all four elements are added, 
using the hardware and the Booth's algorithm of 

40 the CSA(1) and the CSA(2) repeatedly, four times, 
and also using a spill adder (SPA) 11 for com- 
pensating raised carry in the low units omitted 
during the operation of the CSA (1) 50 and CSA(2)- 
51; and the result of population counting of the 

45 given numeral data is output at the ZR 8. 

As stated above, when the first-type multiplying 
unit is used for performing population counting, 
CSA(1) and CSA(2) are used, repeating their opera- 
tion as many times as there are divided elements, 

so which results in considerable wastage in counting 
up all "1 " bits of the numerical data. 

For exemplifying the present invention, first 
and second embodiments will be explained, using 
two kinds of second-type multiplying unit each 

55 including four multiplying sub-units, referring to 
Rgs. 4 to 9 and Figs. 10 to 12 respectively. In each 
embodiment, the multiplicand and the multiplier 
consist of 8 bytes respectively. 
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In the first embodiment, the second-type mul- 
tiplying unit operates, in the regular multiplication 
mode, under a calculating form such that the mul- 
tiplicand and the multiplier are divided each into 
two elements, so that each element consists of 4 
bytes. The multiplicand is divided into an upper 
multiplicand element (CU) and a lower multiplicand 
element (CL) and the multiplier is divided into an 
upper multiplier element (IU) and a lower multiplier 
element (IL). Then the regular multiplication is per- 
formed by multiplying the elements with each other 
like CU x IL, CL x IL, CU x IU, CL x IU, using the 
four multiplying sub-units respectively, and the 
multiplied results from the multiplying sub-units are 
added by a carry save adder (a second CSA (CSA- 
(2')) and a carry propagate adder (a second CPA 
(CPA(2)). 

Rg. 4 is a block diagram of a second multiply- 
ing unit as used in the first embodiment. In Rg. 4, 
the same reference symbols or numerals as in Rg. 
3 designate the same or similar functions or parts 
as in Rg. 3. In Rg. 4, when the second-type 
multiplying unit operates in the regular multiplica- 
tion mode, the multiplications of CL x IU, CU x IL, 
CU x IU and CL x IL are performed by a multiply- 
ing sub-unit A, which is simply called a "sub-unit 
A" hereinafter, a sub-unit B, a sub-unit C and a 
sub-unit D respectively. In the population counting 
mode, however, the sub-units A and B operate in 
the population counting mode and sub-units C and 
D operate in the regular multiplication mode. 
Therefore, only the internal block diagram forms of 
the multiplying circuits of sub-unit A and C are 
shown in Rg. 4, leaving other sub-units B and D 
blank except for registers at input and output of the 
sub-units. 

Regular multiplication is performed as follows: 
numerical data for performing the multiplication is 
set to VR 1 ; from VR 1 , the 8-byte multiplicand and 
the 8-byte multiplier are sent to the multiplicand 
REG 1a and the multiplier REG 1b respectively; 
the multiplicand in the REG 1A is divided into CU 
data and CL data and the multiplier in the REG 1b 
is divided into IU data and IL data so that each 
element consists of 4 bytes; the CL data in the 
REG 1a and the IU data in the REG 1b are set to 
REG 2a and REG 2b in the sub-unit A respectively; 
in the sub-unit A, the IU data set in the REG 2b is 
sent to a decoder (DCDR) 3 in which the decoded 
signals obtained from the IU data are produced and 
sent to an MG 4; while, the CL data set in the REG 
2a is also sent to the MG 4 in which the multiple 
generation is performed with the CL data and the 
decoded signals as to the IU data; the output data 
from the MG 4 is sent to a first CSA (CSA(l')) 5 
and a first CPA (CPA(l')) 6, in which the output 
data from the MG 4 are added in accordance with 
Booth's algorithm, producing the partial product CL 



x IU at a result register REG 7a; in the sub-units B, 
C and D, similar operations to those performed in 
the sub-unit A are performed respectively, produc- 
ing the partial products CU x IL, CU x IU and CL x 

5 IL respectively; these partial products are sent to a 
second CSA (CSA(2')) 8 and a second CPA (CPA- 
(2')) 9 by which the final result of the regular 
multiplication is obtained; and the final result is 
output to a result REG 11 through a post shifter 10 

w for normalization. Thus, in the second multiplying 
unit, regular multiplication can be performed by 
making the four sub-units operate at the same 
time, which results in a shortening of operation 
time, compared with the operation time of the first- 

75 type multiplying unit. 

When population counting is performed using 
the second-type multiplying unit shown in Rg. 4, 
the mode of the second-type multiplying unit is 
changed to the population counting mode. In this 

20 mode the 8 byte numerical data for population 
counting, which will be called the "input 8-byte 
data" hereinafter, is given to the VR 1, and the 
input 8-byte data is set to the REG 1b in which the 
input 8-byte data is equally divided into two ele- 

25 ments called IU data and IL data, each consisting 
of 4 bytes. The IU data is set to REGs 2b and 2f in 
the sub-units A and C respectively, and the IL data 
is set to the REGs 2d and 2h in the sub-units B 
and D respectively. In the sub-unit A, the IU data is 

30 sent to a third CSA (CSA(3')) 12 composed of 
sixteen half adders 12-0, 12-1, — , 12-14 and 12-15, 
by which sixteen HC signals HC00, HC01, HC02, 
— , HC14 and HC15 and sixteen HS signals HSOO, 
HS01, HS02, — , HS14 and HS15 are produced and 

35 sent to a selector 41 composed of a first selector 
(SEL(1)) 41 a and a second selector (SEL(2)) 41b, 
as shown in Fig. 4 and in Rg. 5 in detail. The 
selected data from the selector 41 are sent to the 
CSA(1 ') 5 having seventeen inputs and six steps of 

40 addition circuits. The output from the CSA(1 ') 5 is 
sent to the CPA(l') 6 in which a carry and a sum 
output from the CSA-(l') 5 are added. The results 
of the addition obtained by the CPA(1 ') are set in 
the REG 7a. 

45 The REG 2a has a function of outputting mul- 
tiplicand bit signals and inverted signals to the 
multiplicand bit signals in the regular multiplication 
mode. The output signals from the REG 2a are 
shown in Rg. 5, and in the output signals from the 

so REG 2a, a plus signal such as + R2-31 indicates a 
regular bit signal at the 31st bit position of the REG 
2a and a minus signal such as -R2-31 indicates the 
inverted signal to the bit signal + R2-31 . 

Rg. 5 is a circuit diagram showing the circuit 

55 connections among the CSA(3 ) 12, the SEL(1) 
41a, the SEL(2) 41b, the DCDR 3, the MG 4 and 
the CSA(1 ) 5. In Rg. 5, the same reference sym- 
bols, or numbers, as in Rg. 4 designate the same 
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or similar units or parts as in Fig. 4. The REG 2b, 
which is not depicted in Fig. 5, has 32 bit-positions 
for setting the 4-byte IU data, and the bit-signals 
set in the 32 bit-positions are indicated by + R3-0, 
+ R3-1, + R3-2, + R3-30 and +R3-31. In the 
population counting mode, the bit-signals + R3-0 to 
+ R3-31 set in the REG 2b are sent to the CSA(3') 
12 including sixteen half adders (HAs) 12-0, 12-1, 
12-2, — . 12-14 and 12-15. Two bit-signals set in 
the bit-positions (of the REG 2b) adjacent each 
other are sent to one of the sixteen HAs for per- 
forming the half addition of the two bit-signals. For 
example, the bit signals + R3-0 and + R3-1 set in 
the bit position 0 and 1 . adjacent to each other, in 
the REG 2b are sent to an HA 12-0 in the CSA(3') 
12. In each HA, a half sum (HS) signal and a half 
carry (HC) signal are produced, so that 16 pairs of 
HS and HC signals are output from the CSA(3') 12 
and sent to the SEL(2) 41b and the SEL(1) 41a 
respectively. For example, a pair of signals + HS00 
and + HC00 is output from the HA 12-0 and sent to 
the SEL{2) 41b and the SEL(1) 41a respectively as 
shown in Rg. 5. 

Ail 65 decoded signals +G1-POS1, + G1- 
NEG1, + G1-POS2, + G1-NEG2, + G16-POS2, 
+ G16-NEG2 and +G17-P0S1 output from the 
DCDR 3 are set to bit "0". in the population count- 
ing mode. Accordingly, in the population counting 
mode, the input signals to the MG 4 are all set to 
bit "0", so that the output signals from the MG 4 
also become bit "0" as seen from Figs. 6 and 7. 
Fig. 6 is a block diagram illustrating the wiring 
connection between the MG 4 and the SEL 41a, 
whilst Fig. 7 illustrates wiring connection between 
the MG 4 and the SEL 41b. In Fig. 7, the same 
reference symbol or number as in Figs. 5 or 6 
designates the same unit or signal as in Figs. 5 or 
6. As shown in Figs. 6 and 7, the output signals 
+ G2-30, + G3-30, — , +G16-30 and +G17-30 
from the MG 4 are sent to the SEL(1) 41a, the 
output signal +G2-31, +G3-31, — . + G16-31 and 
G17-31 from the MG 4 are sent to the SEL(2), and 
the other output signals from the MG 4 are directly 
sent to the CSA(1 ); wherein the numbers 30 and 
31 indicate the bit positions, which will be ex- 
plained later with reference to Fig. 8, in the CSA- 
(l'). The output signals, each having the number 
30, from the MG 4 are suppressed by AND circuits 
in the SEL(1) 41a in the population counting mode, 
so that only the output signals + HC-00, + HC-01, 

, + HC-14 and + HC-15 from the CSA(3') 12 

are output from the SEL(1) 41a as the input signals 

+ G2-30-S, + G3-30-S, § , + G1 6-30-S and 

+ G17-30-S to the CSA(l') 5. In the same way, the 
output signals, each having the number 31, from 
the MG 4 are suppressed by AND circuits in the 
SEL(2) 41b in the population counting mode, so 
that only the output signals +HS-00, +HS01, — , 



+ HS14 and +HS-15 from the CSA(3) 12 are 
output from the SEL(2) 41b as the input signals 
+ G2-31-S, +G3-31-S, — , +G16-31-S and 
+ G17-31-S to the CSA(l') 5. 

5 Meanwhile, in the regular multiplication mode, 

the output signals from the CSA(3) 12 are sup- 
pressed at the SEL(1) and the SEL(2), and the 
output signals from the MG 4 are sent to the CSA- 
(1*) directly and through the SEL(1) 41a and the 

10 SEL(2) 41b as seen from Figs. 5, 6 and 7. 

Again in the population counting mode, the 
signals relating to the HS and HC signals of the IU 
data are input to the CSA(l') 5 in which the input 
signals each having the number 30, for example 

rs +G2-30-S, and the number 31. for example +G2- 
31 -S, are set at a definite bit position of sixteen bit 
rows described in Fig. 8. 

Fig. 8 is a chart showing schematically a way 
of addition used for the multiplication in the CSA- 

20 (1 ). The chart corresponds a partial product of 4 
byte x 4 byte performed in the regular multiplica- 
tion mode. A numeral of 32 bits is set in each row, 
which will be called "bit row" hereinafter, in the 
regular multiplication mode; however, in the popu- 

25 lation counting mode, bits "0" are imposed at all 
positions except the hatched positions because the 
all input signals to the CSA(l') 5 from the MG 4 
are set to bit "0" in the population counting mode 
as stated before. 

ao For example, the input carry signal +G2-30-S 

to the CSA(l') 5 is set in the bit low G2 at a bit 
position corresponding to the 30th bit-position in a 
64-bit carry numeral line depicted at the bottom in 
Fip. 8; the input sum signal +G2-31-S to the CSA- 

35 (1 ) 5, related to the carry signal + G2-30-S, is set 
in the bit low G2 at a bit position corresponding to 
the 31st bit position in the 64-bit numeral line 
depicted above; the input carry signal +G3-30-S to 
the CSA(l') 5 is set in the bit low G3 at a bit 

40 position corresponding to the 30th position in the 
64-bit numeral line; the input sum signal +G3-31-S 
to the CSA(l') 5 is set in the bit low G3 at the 31st 
bit position in the 64-bit numeral line; and so on. 
Accordingly, the carry and sum data respec- 

45 tively set at the 30th and 31st position of each bit 
row are vertically lined up. As seen from Fig. 8, G1 
bit rows are not used in the population counting 
mode.. 

The bit numerals of the sum and carry set in 
so the G2 to G17 bit rows are added in the CSA(l') 5 
and CPA(l') 6. The added result is set at the 26th 
to 31st bit position, which are hatched, in the 64-bit 
numeral line at the bottom of the chart. The result 
represents the bit numeral of "1" bits in the 4-byte 
55 IU data set in the REG 2b in the sub-unit A. The 
result is sent to the REG 7a. 

Since the input 8-byte data set in the REG 1b 
is equally divided into two elements, two sub-units 
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are enough to perform the population counting. 
Therefore, in this embodiment, the sub-units A and 
B are used in both modes, the population counting 
mode and the regular multiplication mode, and 
other sub-units C and D are used only in the 
regular multiplication mode. Accordingly, the hard- 
ware and the function of the sub-unit B is the same 
as those of the sub-unit A, and the hardware and 
the function for the sub-units C and D are different 
from those for the sub-units A and B. 

The sub-units C and D have the same function 
and the hardware each other, except the multipli- 
cand and the multiplier in the sub-units are dif- 
ferent. The sub-unit C has the function of perform- 
ing the regular multiplication by multiplying the CU 
data and the IU data in the regular multiplication 
mode and producing all bits "0" in the population 
counting mode. Therefore, the sub-unit C has the 
hardware such as a REG 2e having the same 
function as the REG 2a in the sub-unit A, but no 
CSA(3') such as the CSA@') 12 In the sub-unit A 
and no SEL such as the SEL 41 in the sub-unit A. 
As mentioned above, since the REG 2e has the 
same function as the REG 2a in the sub-unit A, 
from the REG 2e, the regular CU data are output in 
the regular multiplication mode and all bit "0" 
signals are output so that all bit "0" signals are 
output from a REG 7c to the CSA(2) in the popula- 
tion counting mode. The block diagram for the sub- 
unit C is depicted in Fig. 4. Since the block dia- 
gram for the sub-unit D is equal to that for the sub- 
unit C, the sub-unit D block diagram is omitted to 
be depicted in Fig. 4. 

In the sub-unit B, the added result is set at the 
26th to 31st bit position, which are hatched, in the 
64-bit numeral line at the bottom of the chart in the 
population counting mode. Wherein, the IL data is 
sent to an REG 2d in the sub-unit B from the REG 
1b as seen from FIG. 4. 

The two results output from sub-units A and B 
are added by the CSA<2') 8 as shown in Fig. 4. 
The output of the CSA(2') 8 is sent to the CPA(2') 
9 and added therein. The results of the CPA-(2') 9 
is post-shifted by the post shifter 10 and set in the 
REG 1 1 , setting the result data to the positions for 
the upper 8 byte. 

Fig. 9 illustrates the adding way of the operat- 
ing results of the four sub-units A, B, C and D, 
performed by the CSA(2') 8 and the CPA(2') 9. A 
symbol "R2 CAND" indicates the multiplicand con- 
sisting of the CU data and the CL data set to the 
REG(R2) 1a, and a symbol "R3 IER" indicates the 
multiplier consisting of the IU data and the IL data 
set in the REG(R3) 1b. In the regular multiplication 
mode, the addition of the partial products CLxIL, 
CUxlU CLxlU and CUxlU are performed by the 
CSA(2') 8 and the CPA(2) 9 as shown in Fig. 9. 
Wherein, the partial products CLxIL, CUxIL, CLxlU 



and CUxlU are obtained from sub-units D, B, A and 
C respectively. However, in the population counting 
mode, the partial products are obtained only from 
the sub-units A and B and furthermore the bit "1 " 

5 results of the IU data, obtained by the sub-unit A, 
and those of the IL data, obtained by the sub-unit B 
are both in the same bit position as depicted by 
the hatched portions in Fig. 9. Therefore, the result 
of the addition can be obtained by simply adding 

10 the hatched portion indicated by IL and IU, using 
the CSA(2') 8 and the CPA(2') 9 as in the regular 
multiplication mode. The data included in the upper 
8-byte positions are sent to the REG 11 through 
the post SFT 10. 

75 The execution of the population counting in- 
struction is summarized as follows: 

1) the input 8-byte data for the population 
counting is set in the REG 1 b from the VR 1 . 

2) the upper 4-byte data (IU data) of the 
20 input 8-byte data set in the REG 1b are set in the 

REG 2b of the sub-unit A, and the lower 4-byte 
data (IL data) of the input 8-byte data in the REG 
1b are set in the REG 2d of the sub-unit B; 

3) the divided 4 byte (32 bits) data (IU and IL 
25 data) are further divided into 16 pairs of two bits 

and 16-bit sum and carry are obtained by 16 half 
adders, suppressing the route from the REG 2b to 
the DCDR 3; 

4) the output of the half adders is input to 
30 the CSA(l') 5 through the selector 41; 

5) the number of "1 " bits in the IU data is 
obtained by addition performed by CSA-(l') 5 and 
CPA-(l') 6 in the sub-unit A, 

6) the number of "1" bits in the IL data is 
35 obtained, in the same way as in the sub-unit A, in 

the sub-unit B at the same time; 

7) the number of "1" bits in the IU data and 
the number in the IL data are set in the REG 7a in 
the sub-unit A and the REG 7b in the sub-unit B; 

40 and 

8) the data in the REGs 7a and 7b are 
added by CSA(2') 8 and CPA(2) 9, taking the 
weights of respective bits into account. 

45 Next, the second embodiment of the present 

invention will be explained. 

Fig. 10 shows a block diagram of a second 
multiplyng unit as employed in the second em- 
bodiment of the present invention, which includes 

50 four multiplying sub-units 16- A, 16-B, 16-C and 16- 
D having the same constitution as each other and 
operating in accordance with a calculating form 
different from that in the case of the first embodi- 
ment. In Fig. 10, the multiplicand and the multiplier 

55 are stored in registers REGs 14 and 15 respec- 
tively, and the output of the four sub-units are 
added by a CSA 17 and a CPA 18 and sent to a 
register REG 20 through a post shifter SFT 19. 
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Only the sub-unit 16-A will be explained because 
the sub-units 16-B, 16-C and 16-D are the same as 
the sub-unit 16-A as to their construction and func- 
tion. 

In the second embodiment, the 8-byte multi- 
plier is divided into four 2-byte elements which are 
sent to the sub-units 16-A, 16-B, 16-C and 16-D 
respectively. The operation for multiplication and 
population counting in the sub-unit 16-A is essen- 
tially the same as in the sub-unit A of the first 
embodiment except the data set to the register 
REG 21 and to the register REG 22 is 8-byte and 
2-byte data respectively. 

In the population counting mode, an 8-byte 
multiplicand stored in the register REG 14 is sent 
to a register REG 21 in the sub-unit 16-A and to 
three other registers, having the same function as 
the REG 21, in the three sub-units 16-B, 16-C and 
16-D respectively. Meanwhile, the 8-byte input data 
for population counting is stored in the REG 15 
once instead of the 8-byte multiplier and equally 
divided into four elements each consisting of 2- 
byte data for population counting. Each 2-byte data 
is sent to a register REG 22 in the sub-unit 16-A 
and to three other registers, having the same func- 
tion as REG 22, in the sub-units 16-B, 16-C and 
16-D. The 2-byte data set in the REG 22 is sent to 
a third CSA (CSA(3")) 27. A half carry (HQ 27a 
and a half sum (HS) 27b output from the CSA (3*) 
27 are sent to a first CSA (CSA(l")) 25, having nine 
input terminals and four steps for addition, through 
a SEL 32. A sum and carry output from the CSA 
(l") 25 are added by a first CPA (CPA(l")) 26. The 
result of the addition from the CPA(1 ") is set in an 
REG 30-A. 

The same operation as is in the sub-unit 16-A 
is executed respectively in the sub-units 16-B, 16- 
C and 16-D simultaneously. The four results ob- 
tained by the sub-units 16-A, 16-B, 16-C and 16-D 
are added by a second CSA (CSA(2")) 17 and a 
second CPA (CPA(2")) 18 to obtain a total result of 
the 8-byte input data. The output from the CPA(2") 
18 is set in a register REG 20 through a post 
shifter 19. 

Fig. 1 1 shows schematically a way of addition 
in the CSA(l") 25 in the sub-unit 16-A to obtain the 
full sum and the fully carry. In the sub-unit 16-A, 
the bit signal of carry through a first selector which 
is a part of the SEL 32 and not depicted in Fig. 10 
and the bit signal of sum through a second selector 
which is another part of the selector 32 are input to 
a terminal G2, which is not depicted, of the CSA- 
(1 ) 25 and occupy the 48th and 49th bit positions 
of a 64-bit numeral row, respectively. The similar 
bit signals input to a terminal G3 of the CSA(l") 25 
occupy the bit 50th and 51st bit positions and so 
on. Those input to a terminal G9 of the CSA(1 ") 25 
occupy the 62nd and 63rd bit positions. In Fig. 11, 



the same addition in the sub-units 16-B, 16-C and 
16-D are indicated together. 

The results of addition of the carry and sum by 
the CSA(l") 25 are in the positions from 59th to 

5 63rd, as shown at the bottom of the chart. In the 
same way, the position of the data of the carry and 
sum in the sub-units 16-B, 16-C and 16-D are from 
43rd to 47th, from 27th to 31st and from 11th to 
15th respectively, as shown at the bottom of the 

10 chart in Fig. 11. The full sum and full carry ob- 
tained in the CSA(l") 25 shown at the bottom, are 
added by the CPA (1 ") 26 to obtain the number of 
"1 "s present in the first quarter part of the multi- 
plier. Then, the data is set in the REG 30-A. 

75 Fig. 12 shows schematically a way of addition 

in the CSA(2") 17, and CPA(2") 18 in order to 
obtain the total number of "1 " bits present in the 
multiplier. Each data from the four REGs 30-A, 30- 
B, 30-C and 30-D are added as an addition of 

20 partial products. The data from the REGs 30-A, 30- 
B, 30-C and 30-D are 10 byte. The number of "1"s 
present in a quarter of the multiplier in the REG 15 
is set in the hatched bits of each data. These four 
data are vertically lined up in four parallel rows 

25 shifted by 2 bytes as shown in Rg. 12. So, the 
resultant data is 16 byte. Discarding the lower 8 
bytes, the upper half of the 16 bytes gives the 8- 
byte resultant data, in which the total number of 
"1" bits present in the multiplier is set in the last 

30 seven bits. 

In an embodiment of the present invention pop- 
ulation counting is performed by using a multiply- 
ing unit, in a computer system, including a plurality 
of multiplying sub-units for simultaneously execut- 

35 ing partial multiplications among elements obtained 
by dividing multiplicand data and multiplier data in 
a regular multiplication mode. In a population 
counting mode, input data for the population count- 
ing is divided into population counting elements 

40 instead of the multiplier data and population coun- 
tings for the elements are performed simultaneous- 
ly by using the multiplying sub-units, producing 
partial counted data of the population counting ele- 
ments, and the partial counted 'data are sent to a 

45 pair comprising a carry save adder and carry prop- 
agate adder by which a population counting result 
for the input data is obtained and output. 



so Claims 

1. A multiplying unit provided in a computer 
system, for performing multiplication of multipli- 
cand data and multiplier data in a multiplication 
55 mode and for performing population counting of 
population counting input data in a population 
counting mode, said multiplying unit comprising: 
means for registering and equally dividing the mul- 
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tiplicand data into a plurality of multiplicand ele- 
ments in the multiplication mode; 
means for registering and equally dividing the mul- 
tiplier data into a plurality of multiplier elements in 
the multiplication mode, and for registering and 
equally dividing the population counting input data 
into a plurality of population counting elements, of 
the same number as said multiplier elements, in 
the population counting mode; 
a plurality of multiplying sub-units for executing 
simultaneously partial multiplications among the 
multiplicand elements and the multiplier elements, 
producing partial product data of the partial mul- 
tiplications in the multiplication mode, and for ex- 
ecuting simultaneously partial population countings 
for the population counting elements, producing 
partial counted data of the partial population coun- 
tings in the population counting mode; and 
means for adding said partial product data from 
said multiplying sub-units, outputting a multiplica- 
tion result of the multiplicand data and the multi- 
plier data in the multiplication mode, and for adding 
said partial counted data of the partial population 
countings from said multiplying sub-units, output- 
ting a population counting result of the population 
counting input data in the population counting 
mode. 

2. A multiplying unit according to claim 1, 
wherein a plurality of said multiplying sub-units 
comprise first type multiplying sub-units, of the 
same number as the population counting elements, 
each said first type multiplying sub-unit comprising: 
a first sub-unit register for setting one of the multi- 
plier elements in the multiplication mode, and for 
setting one of the population counting elements in 
the population counting mode; 
a second carry save adder for counting the number 
of "1 " data bits in the population counting element 
registered in said first sub-unit register, simulta- 
neously inputting every two data signals of the 
population counting element, said two data signals 
having been registered respectively in two register- 
ing positions adjacent each other, of said first sub- 
unit register, and outputting elemental sum and 
carry data of the population counting element; 
a second sub-unit register for setting and output- 
ting one of the multiplicand elements in the mul- 
tiplication mode, and for generating and outputting 
all "0" data bits in the population counting mode; 
a decoder for outputting decoded signals required 
for multiplying the multiplicand element and the 
multiplier element, inputting the multiplier element 
from said first sub-unit register in the multiplication 
mode, and for generating and outputting all "0" 
data bits in the population counting mode; 
a multiple generator for generating and outputting 
shifted data by combining the multiplicand element 
data from said second sub-unit register and the 



decoded signals from said decoder in the mul- 
tiplication mode, and for generating and outputting 
ail "0" data bits in the population counting mode; 
a pair of a first carry save adder and a first carry 

5 propagate adder for outputting the partial product 
data of the partial multiplication of the multiplicand 
element and the multiplier element, adding the 
shifted data from said multiple generator in the 
multiplication mode, and for outputting the partial 

to counted data of the number of "1 " data bits in the 
population counting element, adding the elemental 
sum and carry data from said second carry save 
adder in the population counting mode; and 
selector means for selecting the shifted data from 

75 said multiple generator and the elemental sum and 
carry data from said second carry save adder so 
as to send either of said data to said pair of a first 
carry save adder and a first carry propagate adder 
in accordance with the multiplication mode and the 

20 population counting mode respectively. 

3. A multiplying unit according to claim 2, 
wherein a plurality of said multiplying sub-units 
further comprise second type multiplying sub-units 
having a number obtained by subtracting the num- 

25 ber of said first type multiplying sub-units from a 
number required to perform the partial multiplica- 
tion among the multiplicand elements and the mul- 
tiplier elements, each said second type multiplying 
sub-unit comprising: 

30 a third sub-unit register for setting and outputting 
one of the multiplier elements in the multiplication 
mode; 

a fourth sub-unit register for setting and outputting 
one of the multiplicand elements in the multiplica- 
nds tion mode, and for generating and outputting all 
"0" data bits in the population counting mode; and 
a decoder for outputting decoded signals required 
for multiplying the multiplicand element and the 
multiplier element, inputting the multiplier element 
40 from said third sub-unit register in the multiplication 
mode, and for generating and outputting ail "0 M 
data bits in the population mode. 

4. A multiplying unit provided in a computer 
system, for performing multiplication of multipli- 
es cand data and multiplier data in a multiplication 

mode and for performing population counting of 
population counting input data in a population 
counting mode, said multiplying unit comprising: 
means for registering the multiplicand data in the 

so multiplication mode; 

means for registering and equally dividing the mul- 
tiplier data into a plurality of multiplier elements in 
the multiplication mode, and for registering and 
equally dividing the population counting input data 

55 into a plurality of population counting elements, of 
the same number as said multiplier elements, in 
the population counting mode; 
a plurality of multiplying sub-units for executing 
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simultaneously partial multiplications among the 
multiplicand and the multiplier elements, producing 
partial product data of the partial multiplications in 
the multiplication mode, and for executing simulta- 
neously partial population countings for the popula- 
tion counting elements, producing partial counted 
data of the partial population countings in the popu- 
lation counting mode; and 

means for adding said partial product data from 
said multiplying sub-units, outputting a multiplica- 
tion result of the multiplicand data and the multi- 
plier data in the multiplication mode, and for adding 
said partial counted data of the partial population 
countings from said multiplying sub-units, output- 
ting a population counting result of the population 
counting input data in the population counting 
mode. 

5. A multiplying unit according to claim 4, 
wherein a plurality of said multiplying sub-units 
comprise first type multiplying sub-units, of the 
same number as: the population counting elements, 
each said first type multiplying sub-unit comprising: 
a first sub-unit register for setting one of the multi- 
plier elements in the multiplication mode, and for 
setting one of the population counting elements in 
the population counting mode; 
a second carry save adder for counting the number 
of "1" data bits in the population counting element 
registered in said first sub-unit register, simulta- 
neously inputting every two data signals of the 
population counting element, and two data signals 
being what have been registered respectively in 
.two registering positions adjacent each other, of 
said first sub-unit register, and outputting elemental 
sum and carry data of the population counting 
element; 

a second sub-unit register for setting and output- 
ting the multiplicand data in the multiplication 
mode, and for generating and outputting ail "0" 
data bits in the population counting mode; 
a decoder for outputting decoded signals required 
for; multiplying the multiplicand data and the multi- 
plier element, inputting the multiplier element from 
said first sub-unit register in the multiplication 
mode, and for generating and outputting all "0" 
data bits in the population counting mode; 
a multiple generator for generating and outputting 
shifted data by combining the multiplicand data 
from said second sub-unit register and the de- 
coded signals from said decoder in the multiplica- 
tion mode, and for generating and outputting all 
"0" data bits in the population counting mode; 
a pair of a first carry save adder and a first carry 
propagate adder for outputting the partial product 
data of the partial multiplication of the multiplicand 
data and the multiplier element, added the shifted 
data from said multiple generator in the multiplica- 
tion mode, and for outputting the partial counted 



data of the number of "1 " data bits in the popula- 
tion counting element, adding the elemental sum 
and carry data from said second carry save adder 
in the population counting mode; and 

5 selector means for selecting the shifted data from 
said multiple generator and the elemental sum and 
carry data from said second carry save adder so 
as to send either of said data to said pair of a first 
carry save adder and a first carry propagate adder 

10 in accordance with the multiplication mode and the 
population counting mode respectively. 

6. A multiplying unit according to claim 5, 
wherein a plurality of said multiplying sub-units 
further comprise second type multiplying sub-units 

75 having a number obtained by subtracting the num- 
ber of said first type multiplying sub-units from a 
number required to perform the partial multiplica- 
tion among the multiplicand and the multiplier ele- 
ments, each said second type multiplying sub-unit 

20 comprising: 

a third sub-unit register for setting and outputting 
one of the multiplier elements in the multiplication 
mode; 

a fourth sub-unit register for setting and outputting 
25 the multiplicand data in the multiplication mode, 

and for generating and outputting ail "0" data bits 

in the population counting mode; and 

a decoder for outputting decoded signals required 

for multiplying the multiplicand data and the multi- 
30 plier element, inputting the multiplier element from 

said third sub-unit register in the multiplication 

mode, and for generating and outputting all "0° 

data bits in the population mode. 
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